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HOLOGRAPHIC  IMPLEMENTATIONS  OF  NEURAL  NETWORKS 

I.l  INTRODUCTION 

One  of  the  attractive  features  of  neural  computation  is  the  fact  that  neural  algorithms 
can  be  mapped  relatively  easily  onto  analog  hardware.  The  use  of  simple  analog  devices 
allows  for  high  parallelism  in  neural  hardware  and  thus  for  gains  in  processing  power. 
Analog  VLSI  and  optics  are  the  two  technologies  under  development  for  implementations 
of  artificial  neural  networks.  The  advantages  of  VLSI  derive  from  the  maturity  of  silicon 
device  fabrication  technology  and  the  sophistication  of  nonlinear  semiconductor  devices. 
The  advantages  of  optical  implementations  are  that  three  dimensional  linear  interconnec¬ 
tions  may  be  formed  and  modified  relatively  easily  using  optical  holography.  This  is  in 
contrast  with  the  constraints  on  VLSI  which  confine  integrated  networks  to  two  dimen¬ 
sions.  For  neural  network  models  with  a  large  number  of  connections  per  neuron,  the 
area  of  a  VLSI  implementation  is  dominated  by  the  area  of  the  channels  which  intercon¬ 
nect  the  processing  nodes  (the  neurons).  Optical  implementations  are  typically  arranged 
as  shown  in  Fig.I.l  with  planar  arrays  of  neurons  interconnected  externally  to  the  plane. 
This  architecture  permits  the  area  of  the  plane  to  be  fully  populated  with  active  devices, 
allowing  the  construction  of  much  larger  networks.  The  disparity  between  optical  and 
electronic  implementations  in  terms  of  the  number  of  neurons  per  unit  area  which  may  be 
realized  depends  on  the  density  with  which  the  neurons  are  to  be  interconnected  and  the 
functionality  of  the  neurons. 

In  this  chapter  we  consider  networks  with  connections  which  are  dense;  i.e.  each  neu¬ 
ron  is  connected  to  many  others,  and  irregular;  i.e.  the  strengths  of  different  connections 
are  different.  Each  neuron  is  assumed  to  perform  a  simple  threshold  on  a  weighted  sum  of 
the  activations  of  the  other  neurons  to  which  it  is  connected.  In  this  case  it  is  not  neces¬ 
sary  to  implement  one-to-one  connections  between  any  two  units.  Instead  a  single  “bus” 
can  be  used  for  each  neuron  that  collects  the  signals  from  all  the  units  in  its  receptive 
field  and  delivers  the  accumulated  sum  to  the  neuron.  This  fact  reduces  dramatically  the 
complexity  of  the  hardware  needed  to  perform  the  interconnections  for  both  the  optical 
and  electronic  implementations  of  neural  networks.  The  simplest  VLSI  implementation  of 
this  architecture  is  a  cross  bar  which  connects  M  neurons  in  area  [1].  In  this  chapter 
we  examine  the  optical  implementation  of  analog  summing  interconnections  and  we  derive 
basic  relationships  between  the  number  of  neurons  per  unit  area  at  the  “neural  planes”  and 
the  properties  of  the  optical  system  that  is  used  to  perform  the  connections.  In  the  optical 
implementations  the  input  port  to  a  “neuron”  is  a  light  detector  and  the  output  port  b 
an  adjacent  light  source  or  modulator  that  is  electrically  controlled  by  the  the  detected 
signal.  The  weighted  interconnections  between  the  neurons  are  realized  via  holograms  that 
are  placed  between  the  planes.  While  our  discussion  is  based  on  a  specific  au-chitecture 
(the  Vander  Lugt  correlator)  as  an  example,  the  limits  we  derive  and  the  basic  methods 
we  describe  are  generally  applicable  with  only  minor  modifications. 
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1.2  OPTICAL  INTERCONNECTIONS  USING  PLANAR  HOLOGRAMS 

A  schematic  diagram  of  an  optical  correlator  [2]  is  shown  in  Fig.L2.  We  will  utilize 
this  same  basic  architecture  throughout  the  chapter  and  consider  its  implementation  with 
a  planar  hologram  in  this  section  and  a  volume  hologram  in  the  next.  A  point  at  the 
input  plane  (Pi  in  Fig.1.2)  is  connected  to  an  output  point  Pj  as  follows.  The  first  lens 
Li  collimates  the  light  emanating  from  Pi  into  a  single  plane  wave  that  illuminates  the 
hologram.  The  direction  of  propagation  of  this  plane  wave  has  a  one-to-one  correspondence 
with  the  position  of  Pi  at  the  input  plane.  A  hologram  is  placed  at  the  intermediate  plane 
in  Fig.1.2.  Its  piupose  is  to  diffract  the  incident  light  towards  points  at  the  output  plane 
and  thus  interconnect  input  points  to  output  points.  We  can  think  of  the  hologram  as  a 
linear  superposition  of  sinusoidal  gratings.  Each  grating  diffracts  a  portion  of  the  incident 
wave  into  another  plane  wave  propagating  towards  the  output  plane.  The  difference  in 
the  direction  of  propagation  of  the  incident  wave  and  the  direction  of  the  diffracted  wave 
is  determined  by  the  spatial  frequency  and  the  orientation  of  the  fringes  of  each  grating. 
The  second  lens  (Lj  in  Fig.1.2)  converts  each  of  the  diffracted  waves  into  a  focused  spot 
whose  position  at  the  output  plane  corresponds  to  the  direction  of  propagation  of  the 
diffracted  beam.  In  this  manner,  each  sinusoidal  grating  that  is  recorded  on  the  hologram 
interconnects  Pi  to  an  output  point.  The  weight  of  the  connection  is  determined  by  the 
strength  of  the  recorded  grating. 

The  system  of  Fig.1.2  is  shift  invariant.  Once  the  connectivity  of  a  pair  of  input-output 
points  is  determined  by  recording  the  appropriate  grating  on  the  hologram,  then  any  other 
input  point  is  connected  in  the  same  way  to  the  point  at  the  output  plane  that  is  shifted 
from  the  original  output  point  by  a  distance  equal  to  the  sepauration  between  the  two  input 
points.  If  such  a  set  of  four  points  were  selected  for  the  placement  of  neurons  at  the  input 
and  output  planes,  then  it  would  not  be  possible  to  arbitrarily  specify  the  connectivity 
between  the  neurons  in  the  system  of  Fig.1.2.  The  strategy  that  we  use  in  order  to  provide 
independent  interconnections  between  input  and  output  points  is  as  follows.  Once  an 
input/output  pair  is  selected  and  a  grating  is  recorded  for  it,  then  for  each  additional 
input  location  that  is  used  for  the  placement  of  a  neuron,  the  point  that  is  shifted  by  the 
same  amount  at  the  output  is  excluded  from  being  used  as  a  neuron  site.  Similarly,  a  point 
at  the  input  is  eliminated  for  each  additional  output  point  that  is  used. 

This  procedure  is  schematically  drawn  in  Fig.1.3  where  the  2-D  rectangular  grids  of 
available  input  (top)  and  output  (bottom)  pixels  are  drawn.  A  grating  that  connects  two 
points  is  drawn  as  a  solid  arrow  in  this  diagram.  The  use  of  a  second  point  automati¬ 
cally  connects  it  to  the  point  at  the  output  marked  with  an  X  (dotted  arrow)  and  it  is 
therefore  elinunated  from  the  output  grid.  An  analogous  diagram  is  drawn  for  the  use  of 
an  additional  output  point.  In  general,  with  reference  to  Fig.1.3,  each  grating  recorded  in 
the  hologram  specifies  an  interconnection  between  only  one  pair  of  input/output  points  if 
and  only  if  the  diagram  formed  by  connecting  any  two  input  neurons  and  any  two  output 
neurons  cannot  be  a  parallelogram.  We  now  use  this  criterion  to  address  two  issues;  a) 
Capacity,  or  the  maximum  number  of  pixels  at  the  input  and  output  planes  that  can  be 
used  for  the  placement  of  neurons  and  b)  The  derivation  of  appropriate  sampling  grids 
that  provide  this  maximum  capacity. 

Let  us  denote  by  Ni  {N2)  the  number  of  input  (output)  neurons  and  let  N  be  the 
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number  of  available  pbcels  in  1-D  at  the  input  and  output  planes.  The  total  number  of 
connections  that  need  to  be  implemented  is  NiNi  and  each  of  these  connections  must  be 
realized  by  a  distinct  grating  in  order  to  be  independently  speci&able.  In  the  diagram 
of  Fig.1.3,  each  distinct  grating  corresponds  to  a  vector  of  a  given  length  and  direction. 
The  maximum  number  of  distinct  vectors  (i.e.  each  vector  having  different  length  and 
orientation  from  all  others)  that  can  be  drawn  in  the  diagram  of  Fig.1.3,  provides  us  with 
an  upper  bound  for  the  number  of  independent  interconnections.  We  can  count  how  many 
such  vectors  there  are  relatively  easily.  Pick  the  point  at  the  lower  left  comer  at  the  output 
in  Fig.1.3.  distinct  vectors  can  be  drawn  from  this  point  to  points  at  the  input.  If 
we  pick  any  one  of  the  other  three  output  comer  points,  then  each  of  the  vectors 
that  can  be  drawn  connecting  them  to  points  at  the  input  are  different  except  for  vectors 
connecting  to  points  at  the  perimeter  of  the  input  plane.  Subtracting  these  overcounted 
vectors  gives  us  4N^  —  AN  + 1  for  the  number  of  distinct  vectors.  The  order  of  magnitude 
of  the  interconnection  capacity  of  this  system  is  therefore 

NxNi  <  iV*.  (7.1) 

For  example,  let  N\  =  =  10'*.  Then  from  Eq.(I.l)  we  conclude  that  in  order  to  imple¬ 

ment  this  network  we  must  constmct  an  optical  system  that  is  capable  of  accommodating 
N  =  10*  pixels  in  1-D.  This  applies  not  only  to  the  input  and  output  planes  but  also  to  the 
hologram,  which  must  have  resolution  equal  to  N^  pixels  as  well.  Notice  that  the  input 
and  output  planes  are  sparsely  populated  with  neurons  since  only  10*  out  of  the  available 
10*  pixels  are  used. 

We  now  describe  specific  methods  for  selecting  which  N\  (N^)  pixels  out  of  the  avail¬ 
able  pixels  at  the  input  (output)  plane  to  use.  This  selection  can  be  systematically 
accomplished  in  several  ways  and  the  resulting  sampling  grids  are  not  unique.  One  such 
pair  of  sampling  grids  is  shown  in  Fig.I.4a.  In  the  input  a  cluster  of  Ni  =  y/N  x  y/N 
neurons  are  used  whereas  at  the  output  the  nemons  are  arranged  on  a  periodic  grid  with 
period  y/N  +  1.  The  total  number  of  neurons  that  can  be  accommodated  at  the  output  is 
N  also.  We  prove  that  this  is  a  valid  sampling  grid  by  showing  that  it  is  impossible  to  draw 
a  parallelogram  on  the  diagram  of  Fig.I.4a  by  connecting  any  two  input  points  and  any 
two  output  points.  Such  a  parallelogram  cannot  be  formed  because  the  edge  connecting 
two  points  on  the  sampling  grid  at  the  input  plane  would  be  shorter  than  a{y/N  +  1) 
whereas  the  edge  parallel  to  it  at  the  output  plane  would  have  to  be  equal  to  or  longer 
than  a{y/N  +  1).  a  is  a  constant  that  is  determined  by  the  orientation  of  these  two  edges. 

A  different  sampling  grid  is  shown  in  Fig.I.4b.  The  output  grid  is  the  same  as  in  the 
previous  case  but  the  input  is  sampled  with  period  y/N.  We  again  use  the  parallelogram 
test  to  show  that  this  is  a  valid  sampling  grid.  The  edge  of  such  a  parallelogram  connecting 
two  input  points  would  have  length  akiy/N  with  ki  an  integer  in  the  range  0  <  ki  <  y/N. 
The  edge  of  this  same  parallelogram  at  the  output  plane  would  have  length  ak^^y/N  -J- 1), 
with  ka  an  integer  in  the  range  0  <  ka  <  y/N.  The  smallest  pair  of  integers  that  can 
make  the  two  edges  equal  is  ki  =  y/N  +  1,  ka  =  y/N ,  which  is  beyond  the  rauige  of  ki. 
Therefore,  it  is  not  possible  to  draw  a  parallelogram  in  Fig.I.4b  which  proves  the  validity 
of  these  sampling  grids. 

1.3  OPTICAL  INTERCONNECTIONS  USING  VOLUME  HOLOGRAMS 
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We  now  consider  the  interconnecting  capabilities  of  the  system  in  Fig.1.2  with  a  vol- 
ume  rather  than  a  planar  hologram  in  the  intermediate  plane  [3,4,5].  The  distinction  in 
the  mode  of  operation  between  a  planar  and  a  volume  hologram  is  the  sensitivity  of  the 
volume  hologram  to  the  angle  of  incidence  of  the  illumination.  We  will  discuss  the  angu¬ 
lar  sensitivity  of  volume  holograms  with  the  help  of  the  k-space  diagram  of  Fig.1.5.  The 
k-space  representation  is  a  sphere  with  radius  2)r/A  in  which  the  incident  plane  wave  is 
drawn  as  a  vector  with  its  origin  at  the  center  of  the  sphere,  magnitude  equal  to  2s’/A 
and  direction  that  of  the  incident  plane  wave.  A  is  the  wavelength  of  the  incident  light. 
The  grating  is  drawn  as  a  vector  with  its  origin  the  tip  of  the  incident  vector,  magnitude 
equal  to  2s'/A,  and  direction  pointing  perpendiciJar  to  the  fringes  of  the  grating.  A  is  the 
period  of  the  grating.  The  diffracted  optical  wave  is  drawn  as  a  vector  with  origin  the 
center  of  the  sphere  and  magnitude  2ir/A.  The  direction  of  the  diffracted  wave  is  taken 
to  be  towards  the  tip  of  the  grating  vector.  The  efficiency  with  which  light  is  diffracted 
is  determined  by  the  difference  between  this  diffracted  wavevector  and  the  vector  formed 
as  the  sum  of  the  incident  and  grating  vectors  |6].  If  the  tip  of  the  grating  vector  falls  on 
the  sphere,  then  this  difference  reduces  to  zero  and  the  efficiency  is  maximized  (this  is  the 
Bragg  condition).  For  a  finite  difference  the  diffraction  efficiency  is  reduced  in  proportion 
to  the  square  of  the  thickness  of  the  crystal,  i.e.  a  thicker  crystal  is  more  sensitive  to  an 
angular  deviation  from  the  Bragg  condition. 

Returning  to  Fig.1.2,  imagine  that  a  pur  of  input/output  points  and  a  grating  have 
been  chosen  such  that  light  originating  at  the  input  point  produces  a  plane  wave  that  illu¬ 
minates  the  hologram  at  the  Bragg  angle  and  the  diffracted  light  is  focused  at  the  selected 
output  point.  This  situation  is  drawn  in  the  k-space  diagram  (Fig.1.5)  with  the  diffracted 
vector  being  the  vectorial  sum  of  the  incident  and  the  grating  vectors.  Consider  the  two 
circles  that  are  drawn  in  Fig.1.5.  These  circles  are  formed  by  the  intersection  of  the  k- 
space  sphere  with  two  planes,  both  of  them  perpendiculu  to  the  grating  vector.  One  of  the 
planes  contains  the  origin  and  the  second  contains  the  tip  of  the  grating  vector.  Consider 
an  additional  incident  vector  drawn  on  the  k-sphere  such  that  its  tip  lies  on  the  bottom  cir¬ 
cle.  The  grating  that  is  recorded  to  interconnect  the  first  two  neurons  is  perfectly  matched 
to  this  additional  vector.  The  direction  of  the  diffracted  wave  is  found  by  forming  the 
vectorial  sum  of  the  additionsJ  wavevector  and  the  original  grating  vector.  The  tip  of  the 
new  diffracted  wavevector  falls  on  the  upper  circle.  All  such  incident  and  diffracted  waves 
define  a  “degeneracy  cone”  in  k-space  along  which  a  single  grating  specifies  the  connec¬ 
tions  of  all  incident  wavevectors  that  lie  on  the  bottom  circle  to  corresponding  diffracted 
wavevectors  on  the  upper  circle.  In  order  to  implement  independent  interconnections  in 
this  case,  the  location  of  the  neurons  at  the  input  and  output  planes  must  be  chosen  such 
that  no  two  input/output  pairs  share  the  same  degeneracy  cone.  This  condition  can  be 
mapped  to  the  input  and  output  planes  as  shown  in  Fig.1.6.  The  grating  is  drawn  as  a 
vector  connecting  a  point  at  the  input  to  a  point  at  the  output  and  the  two  circles  are 
approximately  mapped  to  lines  perpendicular  to  the  grating  vector.  In  this  diagram,  the 
condition  that  must  be  obeyed  in  choosing  the  location  of  the  neurons  is  that  the  diagram 
that  ie  formed  by  connecting  any  two  input  neurons  and  any  two  output  neurons  cannot  be 
a  rectangle.  As  before,  we  derive  the  capacity  of  the  correlator  implemented  with  a  volume 
hologram  and  then  present  specific  algorithms  for  deriving  valid  sampling  grids. 
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We  can  derive  an  upper  bound  for  the  number  of  independent  interconnections  that 
can  be  implemented  with  a  volume  holographic  correlator  by  starting  with  the  connections 
that  the  system  with  the  planar  hologram  can  implement  and  then  count  the  additional 
connections  that  are  created  by  the  volume  holograim.  Each  distinct  vector  that  can  be 
drawn  in  Fig.1.6  coimecting  an  input  to  an  output  pixel  can  be  used  to  perform  an  inde¬ 
pendent  interconnection.  From  our  discussion  on  plauiar  holograms  we  know  that  there 
are  4JV*  —  4N  + 1  such  vectors.  With  a  volume  hologram  however,  each  such  vector  can  be 
used  multiple  times  because  when  it  is  translated  along  the  direction  of  the  vector,  then  it 
is  no  longer  possible  to  form  a  rectangle  using  the  origins  and  the  tips  of  the  original  and 
translated  vectors.  The  maximum  number  of  truly  distinct  translations  we  can  have  for 
each  vector  is  upper  bounded  by  N,  the  number  of  pixels  that  are  available  in  one  dimen¬ 
sion.  Thtis,  we  obtain  the  upper  bound  for  the  number  of  independent  interconnections 
that  can  be  implemented  with  a  volume  hologram  as  {4N^  —  4N  +  1)  x  N,  or  more  simply 
the  order  of  magnitude  is 

N1N2  <  JV®.  (7.2) 


As  an  example,  if  JVi  =  JV3  =  10^  then  using  £q.(1.2)  we  find  N  >  465.  Notice,  that  this 
requirement  on  the  space-bandwidth  product  of  the  input  plane  and  the  optical  system 
is  reduced  greatly  compared  to  the  planar  hologram  case.  As  a  result  it  is  possible  to 
construct  much  more  compact  systems  when  volume  holograms  are  used.  Another  way 
of  looking  at  the  distinction  between  the  planar  and  volume  holograms  is  in  terms  of  the 
density  with  which  they  allow  us  to  populate  the  input  and  output  planes  with  neurons. 
For  the  symmetric  case  {Ni  =  Nj)  the  number  of  neurons  that  can  be  accommodated  by  a 
plane  of  fixed  space-bandwidth  product  increases  by  a  factor  y/N  when  a  volume  hologram 
is  used. 

We  now  dbcuss  methods  for  deriving  specific  sampling  grids  that  achieve  the  bound 
of  Eq.(1.2).  The  design  criterion  that  is  used  in  selecting  the  locations  for  the  placement 
of  neurons  at  the  input  and  output  planes  is  the  avoidance  of  the  formation  of  a  rectangle 
in  the  diagram  of  Fig.1.6,  as  discussed  earlier.  The  sampling  grid  shown  in  Fig.I.Ta  is 
constructed  by  selecting  adjacent  y/N  columns,  each  having  N  neurons,  as  the  input 
pattern.  The  maximum  separation  in  the  horizontal  direction  is  y/N  —  1  pixels.  If  at 
the  output  plane  any  two  neurons  are  separated  by  less  than  y/N  pixels  in  the  horizontal 
direction,  then  the  possibility  exists  that  a  rectangle  can  be  formed  at  some  angle  using 
these  two  output  points  2md  two  of  the  input  points.  In  Fig.T.Te  this  possibility  is  eliminated 
since  the  output  S2mipling  grid  consists  of  y/N  columns  that  are  separated  by  y/N  pixels. 
Notice  that  both  the  input  and  output  planes  contain  in  this  case  neurons.  A  second 
possibility  is  shown  in  Fig.I.7b.  In  this  case  the  output  pattern  is  the  same  as  the  one  in 
Fig.I.Ta  whereas  the  input  pattern  is  constructed  by  y/N  columns,  each  being  separated 
from  the  adjacent  column  by  y/N  +  1  pixels.  When  we  attempt  to  draw  a  parallelogram 
using  two  input  and  two  output  points  on  the  sampling  grids  of  Fig.I.Tb,  we  find  that 
this  can  only  be  accomplished  if  the  length  of  the  edge  that  connects  the  two  input  points 
{aki{y/N  +  1))  and  the  length  of  the  one  that  connects  the  two  output  points  {ak2y/N) 
have  equal  lengths.  ki  and  kj  are  integers  and  a  is  a  constant.  The  smallest  integers  that 
satisfy  this  equation  are  ki  =  y/N,  kj  =  y/N  +  1,  which  both  yield  an  edge  that  is  larger 
than  N.  Hence,  it  is  not  possible  to  form  the  rectangle  within  the  available  N  x  N  input 
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and  output  planes  and  the  sampling  grids  of  Fig.I.Tb  are  shown  to  be  valid. 

Notice  that  the  number  of  neurons  in  either  plane  for  the  two  sampling  grids  we 
presented  is  JV  x  y/N  =  Equivalently,  we  can  think  of  them  as  patterns  with  fractal 

dimension  3/2.  The  total  number  of  connections  that  are  implemented  by  the  hologram 
is  N1N2  =  N^.  Comparing  this  result  with  Eq.(1.2)  we  find  that  these  sampling  grids 
provide  the  full  interconnection  capacity  that  is  available  with  a  volume  hologram. 

1.4  CONCLUSION 

We  described  how  holograms  can  be  \ised  to  provide  arbitrary,  full  interconnection 
between  two  planes  of  neurons.  The  methods  presented  can  be  extended  in  relatively 
straightforward  ways  to  design  other  sampling  grids  and  to  realize  non-symmetric  (i.e. 

^  -^2)  local  interconnections  [7].  All  such  saunpling  grids  share  the  property  that 
the  available  degrees  of  freedom  of  the  hologram  are  fully  utilized.  In  the  case  of  planar 
holograms  there  are  pixels  available  in  the  area  of  the  hologram  where  as  3-D  storage 
in  volume  holograms  increases  the  capacity  to  N^.  The  overall  volume  required  is  in 
both  cases  proportional  to  (obtained  as  the  product  of  the  area  of  each  plane  which 
is  proportional  to  and  the  minimum  separation  between  planes  which  is  proportional 
to  N).  In  Table  1  we  compare  a  planar  versus  a  thick  hologram  in  terms  of  the  size  of 
the  optical  system  required  to  fully  interconnect  two  layers  each  having  Ni  =  N2  =  M 
neurons.  The  required  overall  volume  of  the  system  ’is  M  times  smaller  when  a  volume 
hologram  is  used.  It  should  be  pointed  out  however  that  the  reduction  in  system  volume 
that  results  from  the  3-D  storage  capability  of  volume  holograms  is  accompanied  by  a 
reduction  in  the  degree  of  control  we  have  in  storing  information.  Thb  is  due  to  the  fact 
that  while  information  is  stored  throughout  the  three  dimensional  medium  in  a  volume 
hologram,  we  can  only  affect  the  stored  contents  through  information  that  we  specify  on 
the  two  dimensional  surface  that  encloses  the  hologram  [8].  The  consequences  of  this  fact 
[9]  must  be  included  along  with  the  geometrical  arguments  presented  here  for  a  complete 
assessment  of  the  relative  merits  of  the  two  types  of  holographic  interconnections. 


Table  1 


PLANAR  VS.  VOLUME  HOLOGRAMS 

M  =  Number  of  Neurons 

N  =  1-D  Space  Bandwidth  Product 

2-D 

3-D 

Linear  Dimension 

M 

A/V3 

Area 

A/VB 

Total  System  Volume 

M» 

Volume  Ratio 

R  =  V2-DIV2-D  =  M 
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Fig.I.l  Basic  optical  processing  system. 


Fig. 1.2  Vander  Lugt  correlator. 


INPUT 

PUNI 


OUTPUT 

PUNI 


Fig.1.3  For  2-D  holograms,  no  fig¬ 
ure  with  vertices  at  a  pair  of 
input  neurons  and  a  pair  of 
output  neurons  may  form  a 
parallelogram. 
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LOWER  BOUND  FOR  CONNECTIVITY 
IN  LOCAL-LEARNING  NEURAL  NETWORKS 

n.l  INTRODUCTION 

Learning  by  example  has  emerged  as  the  most  important  question  in  neural  networks. 
Clearly,  a  given  neural  network  cannot  just  learn  any  function,  there  must  be  some  re¬ 
strictions  on  which  networks  can  learn  which  functions.  One  obvious  restriction,  which 
is  independent  of  the  learning  aspect,  is  that  the  network  must  be  big  enough  to  accom¬ 
modate  the  circuit  compleMty  of  the  function  it  will  eventually  simulate.  A  restriction 
that  arises  merely  from  the  fact  that  the  network  is  expected  to  learn  the  function,  rather 
than  being  purposely  designed  for  the  function  is  reported  in  [Abu-Mostafa,  1988].  The 
restriction  imposes  a  lower  bound  on  the  connectivity  of  the  network  (number  of  synapses 
per  neuron).  In  this  paper,  we  describe  a  generalization  of  this  result  by  removing  one  of 
the  requirements  on  the  learning  mechanism.  Instead  of  requiring  that  the  training  sample 
itself  be  loaded  directly  into  the  neurons,  we  now  allow  arbitrary  features  to  be  extracted 
from  the  sample  and  loaded  into  the  neurons.  This  also  implies  that  the  number  of  neurons 
can  be  very  large  with  respect  to  the  nund>er  of  bits  in  each  sample. 

However,  our  generalized  result  still  assumes  a  local-learning  mechanism.  The  local¬ 
learning  assumption  allows  only  local  information  to  be  used  by  each  neuron  in  its  learning 
effort.  The  assumption  cannot  be  completely  removed  since  a  powerful  learning  mechanism 
can  be  designed  that  will  find  one  of  the  low-connectivity  (e.g.,  two-input-NAND-gate) 
circuits  that  fits  all  the  training  samples,  perhaps  by  exhaustive  search.  Local-learning  is  a 
strong  assumption  that  excludes  sophisticated  learning  mechanisms  used  in  neural-network 
models. 

The  lower  bound  on  the  connectivity  of  the  network  is  given  in  terms  of  the  entropy  of 
the  environment  that  provides  the  training  samples.  Entropy  is  a  quantitative  measure  of 
the  disorder  or  randomness  in  an  environment  or,  equivalently,  the  amount  of  information 
needed  to  specify  the  environment.  In  section  2,  we  shall  introduce  the  formal  definitions 
and  results,  but  we  start  here  with  an  informal  exposition  of  the  ideas  involved. 

The  environment  in  our  model  produces  patterns  represented  by  N  bits  x  = 

(pixels  in  the  picture  of  a  visual  scene  if  you  will).  Only  h  different  patterns  can  be 
generated  by  a  given  environment,  where  h  <2^  (the  entropy  is  essentially  log2  h).  No 
knowledge  is  assumed  about  which  patterns  the  environment  can  generate,  only  that  there 
aire  h  of  them.  In  the  learning  process,  a  number  of  sample  patterns  are  generated  at 
random  from  the  environment.  A  large  number  of  binary  featmes  are  extracted  from  e2M:h 
sample  amd  input  to  the  network,  one  feature  per  neuron.  The  network  uses  this  informa¬ 
tion  to  set  its  internal  parameters  and  gradually  tune  itself  to  this  particular  environment. 
Because  of  the  network  architecture,  each  neuron  knows  only  its  own  bit  and  the  bits  of 
the  neurons  it  is  directly  connected  to  by  a  synapse.  Hence,  the  learning  rules  are  local:  a 
neuron  does  not  have  the  benefit  of  the  entire  global  pattern  that  is  being  learned. 
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After  the  learning  process  has  taken  place,  each  neuron  is  ready  to  perform  a  function 
defined  by  what  it  has  learned.  The  collective  interaction  of  the  functions  of  the  neurons 
is  what  defines  the  overall  function  of  the  network.  The  main  result  of  this  paper  is 
that  (roughly  speaking)  if  the  connectivity  of  the  network  is  less  than  the  entropy  of  the 
environment,  the  network  cannot  learn  about  the  environment.  The  idea  of  the  proof  is 
to  show  that  if  the  connectivity  is  small,  the  final  function  of  each  neuron  is  independent 
of  the  environment,  and  hence  to  conclude  that  the  overall  network  has  accumulated  no 
information  about  the  environment  it  is  supposed  to  learn  about. 

n.2  LOCAL-LEARNING  NETWORKS 

A  neural  network  can  be  described  as  an  undirected  graph  (the  vertices  are  the  neurons 
and  the  edges  are  the  synapses).  Label  the  neurons  !,“>,>/.  Each  neuron  can  store  one 
bit  at  a  time,  but  it  abo  has  access  to  those  bits  stored  by  the  other  neurons  to  which 
it  b  directly  connected  by  a  synapse.  By  local  learning,  we  mean  that  the  adjustments  a 
nemon  makes  when  the  network  b  loaded  with  a  training  sample  will  depend  only  on  the 
bits  it  has  access  to,  namely  its  own  bit  and  the  bits  of  its  neighbors.  In  other  words,  the 
neuron  does  not  have  the  benefit  of  the  global  picture  in  its  effort  to  learn,  jiist  the  bits  it 
can  see  locally. 

During  the  learning  phase,  an  unknown  environment  provides  a  sequence  of  training 
samples  to  the  network.  The  environment  b  a  subset  e  C  {0, 1}^  (each  x  €  c  b  a 
possible  sample  from  the  environment).  When  the  environment  produces  a  sample  x, 
binary  features  are  extracted  from  x  and  loaded  into  the  neurons 

respectively  (a  feature  b  a  function  /,•  :  {0,1}^  -♦  {0,1}).  For  a  given  network,  the 
features  /i, *■  * , />/  are  arbitrary  but  fixed,  and  M  (the  number  of  neurons)  can  be  much 
larger  than  N  (the  number  of  bits  in  a  sample),  e.g.,  M  can  be  superexponential  in  N. 

As  the  samples  from  the  unknown  environment  e  come  in,  each  neuron  sees  the  subset 
of  features  carried  by  itself  and  its  neighbors.  Consider  an  arbitrary  neuron  that  sees  K 
features  (we  will  assume  K  <  N  <  }/  throughout),  and  relabel  to  mi^e  these 

features  Based  on  the  values  /i  ,.•**,  fx  assume  as  x  varies  over  e,  the  neuron 

b  supposed  to  learn  about  the  environment  such  that,  after  the  learning  phase  is  over,  the 
collective  behaviour  of  the  network  b  tuned  to  the  environment  e  that  provided  the  samples. 
How  the  neurons  absorb  the  learning  information  and  what  computation  the  network  is 
supposed  to  perform  eventually  are  left  deliberately  unspecified.  The  argtiments  in  this 
paper  are  based  on  the  lack  of  information  rather  than  the  failure  to  use  information. 

The  connectivity  b  measured  by  the  parameter  K.  Since  our  result  is  asymptotic  in  N, 
we  will  specify  K  as  a  function  of  N;  K  =  aN  where  a  =  a{N)  satifies  lim7\r-»oo  a(N)  =  Oo 
(0  <  <  l).  To  formalize  the  concept  of  unknown  environment,  we  will  consider  the 

ensemble  of  environments  £  of  fixed  entropy  [Abu-Mostafa,  1986] 

f  =  f(i\r)  =  {«c{0,l)''  I  |e|  =  *} 

where  h  =  2^^  (  the  entropy  b  essentially  log2  h  =  0N  )  and  fi  =  fi[N)  satbfies 
\\ms~,oofi{N)  =  /?o  (0  <  <  1).  The  probability  distribution  on  £  b  uniform;  any 

environment  e  €  £  is  as  likely  to  occur  as  any  other. 
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The  neuron  sees  only  the  K  (fixed  but  arbitrary)  functions  /i,  ■  *  * ,  /x  of  each  x  gen¬ 
erated  by  the  environment  e.  For  each  e,  we  define  the  function  n  :  {O,  —*■  {0, 1, 2,  •  •  •} 

where 

n(oi---ojc)  =  |{xee  |  /*(x)  =  o*  for  *  = 
and  the  normalized  version 


i/(oi  •  •  •  ajc)  = - ^ - 

The  function  1/  describes  the  relative  frequency  of  occurrence  for  each  of  the  2^  binary 
vectors  /i(x)  ■  ■  ■  /k(x)  as  x  runs  through  all  h  vectors  in  e.  In  other  words,  1/  specifies  the 
nonlinear  projection  of  e  as  seen  by  the  neuron.  Clearly,  t/(a)  >  0  for  all  a  €  {0,  and 
E.e{o,i)*  *'(•)  =  1- 

Corresponding  to  two  environments  ei  and  £3,  we  will  have  two  functions  1/1  and  1/3. 
If  1/1  is  not  distinguishable  from  t/3,  the  neuron  cannot  tell  the  difference  between  ei  and 
£3.  The  distinguishability  between  1/1  and  1/3  can  be  measured  by 

d(i/i,i/3)  =  i  K(a)  -  1/3 (a) I 

•€{0.1}K 

The  range  of  d(t/x,i/3)  is  0  <  d(  1/1, 1/3)  <  1,  where  ‘0*  corresponds  to  complete  indistin- 
guishability  while  ‘1*  corresponds  to  maximum  distinguishability.  The  main  result  of  this 
paper  is  to  relate  this  distinguishability  to  how  the  connectivity  of  the  network  compares 
with  the  entropy  of  the  environment. 

n.3  MAIN  RESULT 

Let  £1  and  £3  be  independently  selected  environments  from  £  according  to  the  uniform 
probability  distribution.  d(t/i,i/3)  is  now  a  random  variable,  and  we  are  interested  in  the 
expected  value  E{d{ui,U2)).  The  case  where  E{d{i/i,i/2))  =  0  corresponds  to  the  neuron 
getting  no  information  about  the  environment,  while  the  case  where  E(d(ui,i^2))  =  I 
corresponds  to  the  neuron  getting  maximum  information.  E(d(i/i,i/2))  depends,  among 
other  things,  on  the  choice  of  the  features  /i,  •  •  • ,  /x-  For  example,  a  poor  choice  of  the 
/fc’s  as  constant  functions  forces  E(d(i/i,t/2))  to  be  zero  regardless  of  K.  For  which  values 
of  K  does  there  exist  a  choice  of  the  /*’s  that  makes  E{d{i/i,i/2))  close  to  1,  and  for  which 
values  is  E{d{i/i,i/2))  close  to  0  for  all  choices  of  the  fk'Bl  The  theorem  predicts  these 
extremes  depending  on  how  the  connectivity  (represented  by  Oo  in  the  limit)  compares 
with  the  entropy  (represented  by  In  the  limit). 

Theorem. 

1.  If  ao>0o,  then  for  every  N  there  exist  functions  such  that 

limx-.oo  E  {d{ui,i/2))  =  1. 

2.  If  Oo  <  I  then  for  iJl  functions  A, •  •  • , /x  for  all  N,  limx-.oo  E {d{y  1^1/2))  =  0- 
Proof. 
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1.  We  shall  take  the  functions  /it* * *»/«:  to  be  the  simple  projection  functions 

fk{xi  °  •  zfc  ■  *  *  xs)  =  xit-  Thus  the  neuron  sees  the  first  K  bits  xi  •  •  •  zx  of  the  sample 
X  =  Zi  *  •  'Zjv.  We  start  with  some  basic  properties  about  the  ensemble  of  environments 
£.  Since  the  probability  distribution  on  ^  is  uniform  and  since  |£|  =  (\  )t  we  have 

which  is  equivalent  to  generating  e  by  choosing  h  elements  x  E  {0, 1}^  with  uniform 
probability  (without  replacement).  It  follows  that 


Pr(x  €  e)  =  ^ 

while  for  Xi  ^  X3, 

Pr{xi  €  e  ,  X2  €  e)  =  ^  X 

and  so  on. 

The  functions  n  and  1/  are  defined  on  KA>ii  vectors.  For  the  above  choice  of  the 
functions  /i  •  •  •  /x,  the  statistics  of  n(a)  (a  random  variable  for  fixed  a)  is  independent  of 

a 

Pr(n(ai)  =  m)  =  Pr(»(a3)  =  m) 

which  follows  from  the  symmetry  with  respect  to  each  bit  of  a.  The  same  holds  for  the 
statistics  of  i'(a).  The  expected  value  £(n(a))  =  h2~^  {h  objects  going  into  2^  cells), 
hence  E[u{a))  =  2”^. 

We  expand  E  {d{ui,U2))  as  follows 


£;(d(i/i,i/3))  =  E 


|i/i(a)  -  i/3(a)| 


1 

2k 


E 

•e{o,i}K 


£?(|ni(a)-n3(a)|) 


=  “”2!) 


where  ni  and  113  denote  ni(O'--O)  and  n3(0*'-0),  respectively,  and  the  last  step  follows 
from  the  fact  that  the  statistics  of  ni(a)  and  ns  (a)  is  independent  of  a.  Therefore,  to 
prove  the  first  part  of  the  theorem,  we  assume  a*  >  Pq  and  evaluate  ^(|ni  —  nsj)  for  large 
N.  Let  n  denote  n(0*  *  *0),  and  consider  Pr(n  =  0).  For  n  to  be  zero,  all  2^“^. strings  x 
of  N  bits  starting  with  K  O’s  must  not  be  in  the  environment  e.  Hence 


Pr(n-0)-(l  2^-1^  2^-2^-'^  +  !^ 
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where  the  first  term  is  the  probability  that  0  •  •  *  00  ^  e,  the  second  term  is  the  probability 
that  0  •  •  •  01  ^  e  given  that  0  •  •  •  00  ^  e,  and  so  on. 

.  (,  A 

>  (1  -  2h2"^)*''"* 

>  1  -  2h2-^2"-^ 

=  1  -  2h2-^ 

Hence,  Pr(ni  =  0)  =  Pr(n2  =  0)  =  Pr(n  =  0)  >  1  -  2h2~^.  However,  E(ni)  =  E{n2)  = 
h2~^.  Therefore, 

h  h 

f;(|ni  -  rial)  =  =  *>2  =  j)\i  -  j'l 

t=0  j=0 

k  h 

=  X)  S  =  0Pr(«2  =  i)|»  -  i| 

,=o  j=o 

k 

^  =  0)Pr(n2  =  j)j 

y=o 
k 

+  Pr(ni  =  t)Pr(n2  =  0)t 
»=o 

which  follows  by  throwing  away  all  the  terms  where  neither  t  nor  j  is  zero  (the  term  where 
both  i  an  j  are  zero  appears  twice  for  convenience,  but  this  term  is  zero  anyway). 

=  Pr(ni  =  0)E{n2)  +  Pr(n2  =  O)^(ni) 

>  2(1  -  2h2~^)h2-^ 

Substituting  this  estimate  in  the  expression  for  E[d{i/i,U2)),  we  get 

^(d(|/j,i/2))  =  -  »2l) 

>  —  X  2(1  -  2h2-^)h2-^ 

2h 

=  1  -  2h2“'^ 

=  1  -  2  X 


Since  a©  >  fio  by  assumption,  this  lower  bound  goes  to  1  as  goes  to  infinity.  Since  1 
is  also  an  upper  bound  for  d{ui,t/2)  (and  hence  an  upper  bound  for  the  expected  value 
E{d{ui,U2))),  lims^oo  E{d{uut/2))  must  be  1. 


3.  Assume  Oo  <  Po%  and  consider  arbitrary  functions  /i ,  *  *  ■ ,  /x-  Define 


»(*)  =  ^|{3£e{0,l}^  I  /*(x)  =  o*forfc  =  l,---,K}| 

We  expand  E{d{vui/2))  as  follows 

E{d{vui'2))  =  ~  X) 

=  ^  E  ^?(l(ni(a)-ft(a))-(»2(a)-fl(a))l) 

•€{0.1}'^ 

<  ^  X)  -  ft(a)  I  +  |»2  (a)  -  ft(a)  I) 

•€{0.1}K 

=  S  S  E(l»,(.)-R(a)|)+£:(|n,(a)-R(a)|) 

•€{0.1}* 

=  1  E(|»(a)-n(a)|) 

^  •e{o.i>K 

The  statistics  of  n(a)  now  depends  on  a  since  the  functions  /i  **  *  /x  are  arbitrary.  To 
evaluate  £(|n(a)  -  n(a)|),  we  first  show  that  ft(a)  =  JS(n(a)),  then  estimate  the  variance 
of  n(a)  and  use  the  fact  that  £(|n(a)  —  £(n(a))|)  <  \/vSr(n(a]iJ.  We  write 

”(«)  =  E 

x€{0,l>»' 

where  j(x,a)  =  1  if  /*(x)  =  a*  for  Jfc  =  !,•••,  Jf  and  is  sero  otherwise,  and  ^(x)  =  1  if 
X  €  e  and  b  zero  otherwbe  (while  £(x,a)  b  fixed  for  given  x  and  a,  6(x)  b  a  random 
variable  for  a  given  x).  Hence 

x€{0,l>* 

The  expected  value  of  i(x)  b  Pr(x  €  e)  =  h/2^.  Factoring  this  out,  we  are  left  with 
I^x€{o,i}*  *)  equab  | {x  €  {0, 1>^  |  /jk(x)  =  o*  for  fc  =  1,  •  •  • , If } |,  hence 

£(n(a))  indeed  equab  fl(a). 

Since  var(n(a))  =  i?((n(a))*)  -  (^(n(a)))^,  we  need  an  estimate  for  E((n(a))^). 

£((n(a))*)  =  f?  [  5^  X)  «(xi,a)d(x2,a)«(xi)5{x2) 

Vx.€{0,l>*x,€{0,l>* 


For  the  ‘diagonal’  terms  (xi  =  Xj),  we  get  (^‘“ce  =  6),  which  equals 

ft(a).  For  the  ‘off-diagonal’  terms  (xi  ^  Xj),  we  get 

12  ^(Xl,9i)S(X2,B)E(S(Xi)S(X2)) 

X1  X3#Xi 

=  J2  12  ^(xi,a)«(xa,a)Pr(xiGe,X2  6e) 

Xi  Xa^i 

The  last  step  follows  by  adding  and  subtracting  the  missing  terms  of  the  double  summation. 
Noting  that  fl(a)  =  ^  Hx^(Xta),  this  can  be  rewritten  as 

*  ”  2''-!  *  ' 

Putting  the  contributions  from  the  diagonal  and  off-diagonal  terms  together,  we  get 

£((»(«))>)  =  (1(a)  +  -  ^^n(a) 

var(n(a))  =  (E((n(a))»)  -  (^(n(a)))» 


<ft(a) 

Thus  we  have  E(|n(a)  —  h(a)|)  <  Y^var(n(a))  <  y/n(aj.  Now,  we  rewrite  the  estimate  for 

E(d(i/i,ua))  <  ^  ^(l»(a)  -  »(a)l) 

E  ^ 

•€{0,1}*^ 
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The  values  of  the  individual  ft  (a)  will  depend  on  the  choice  of  /i  •  •  •  /jc.  However, 
ft(a)  always  equals  h  (from  the  definition  of  ft(a)).  Therefore,  one  can  obt^ 

an  upper  bound  for  E{d{ui,U2))  by  maximizing  53«e{o,i}*  subject  to 

53,g{o,i}ir  The  maximum  occurs  when  all  ft(a)  are  equal  (=  h2~^).  Hence, 

E{d{uiti/2))  <  ^2^Vh2~^  =  Since  Oo  <  0o  hy  assumption,  this 

upper  bound  goes  to  0  as  fV  goes  to  infinity.  Since  0  is  also  a  lower  bound  for  d{ui,U2) 
(and  hence  a  lower  bound  for  the  expected  value  E(d(i/i,U2))),  limj^_»oo  E{d(i/i,i/2))  must 
be  0.  I 

n.4  CONCLUSION 

We  have  shown  that,  under  the  assumption  of  local  learning,  each  neuron  must  have  at 
least  a  certain  number  of  synapses  in  order  to  be  able  to  distinguish  between  environments 
based  on  the  statistics  of  information  it  sees.  While  the  result  is  expressed  as  a  limit, 
it  is  seen  in  the  proof  that  the  rate  of  convergence  to  this  limit  is  exponential  in  N,  the 
dimensionality  of  the  problem.  Further  work  should  address  the  weakening  of  the  local¬ 
learning  assumption,  perhaps  by  restricting  the  amount  of  global  information  flow  or  by 
restricting  the  ability  of  the  neuron  to  make  use  of  the  information  it  sees  (e.g.,  by  modeling 
its  leaiming  mechanism  as  a  finite-state  machine). 
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